Untangling Text Data Mining.PDF
نویسنده
چکیده
The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information. In this paper I will first define data mining, information access, and corpus-based computational linguistics, and then discuss the relationship of these to text data mining. The intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists. I describe examples of what I consider to be real text data mining efforts and briefly outline our recent ideas about how to pursue exploratory data analysis over text.
منابع مشابه
Differentiating between data-mining and text-mining terminology
When a new discipline emerges, it usually takes some time and a great deal of academic discussion before concepts and terms become standardized. Text mining is one such new discipline. In a groundbreaking article, Untangling text data mining, Hearst (1999) tackled the problem of clarifying text-mining concepts and terminology. This article is aimed at building on Hearst's ideas by pointing out ...
متن کاملUntangling Regulatory Text: Multidimensional Separation of Concerns and Task-Oriented Linking
Regulatory text is a complex network of information distinguished by technical language, large volume, and tangled concepts formed by embedded cross-references. Scattering and tangling add to the cognitive pressures on users of regulatory text. Multidimensional Separation of Concerns (MDSOC) is a software engineering method aimed at untangling source code objects that contain cross-cutting, ove...
متن کاملUntangling Text Data Mining
The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or hav...
متن کاملParaphrasing Spoken Japanese for Untangling Bilingual Transfer
One of the problems in spoken language translation is the enormous variety o f expressions not found in text translation. This volume can lead to a sparse translation coverage. In order to tackle this problem, we take the practical approach of untangling slight variations in the source language before transferring a source expression to its target. We therefore discuss how eective paraphrasing ...
متن کاملUsing Motion Planning for Knot Untangling
In this paper we investigate the application of motion planning techniques to the untangling of mathematical knots. Knot untangling can be viewed as a high-dimensional planning problem in reparametrizable configuration spaces. In the past, simulated annealing and other energy minimization methods have been used to find knot untangling paths. We have developed a probabilistic planner that is cap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999